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Currently, the data generated in the university environment related to the 
perception of satisfaction is generated through surveys with categorical 
response questions defined on a Likert scale, with factors already defined to 
be evaluated, applied once per academic semester, which generates very 
biased information. This leads us to wonder why this survey is applied only 
once and why it only asks about some factors. The objective of the article is 
to demonstrate the feasibility of a proposal to determine the degree of 
perception of student satisfaction through the use of data science and natural 
language processing (NLP), supported by the social network twitter, as an 
element of data collection. As a result of the application of this proposal 
based on data science, it was possible to determine the level of student 
satisfaction, being 57.27%, through sentiment analysis using the Python 
library "NLTK"; Thus, it was also possible to extract texts linked to the 
relevant factors of teaching performance to achieve student satisfaction, 
through the term frequency and inverse document frequency (TF-IDF) 
approach, these being those linked to the use of tools of simulation in the 
virtual learning process. 
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1. INTRODUCTION 


In recent years, technological advances have allowed the generation and access to large volumes of 
data in different online environments, one of the most relevant being the social networks that include 
Facebook, Twitter or WhatsApp due to this the need for new paradigms arises that allow a better 
understanding of the dynamics of said transformations both of the ontological and of the epistemological [1], 
[2]. In this context the discipline of global studies originates with the purpose of better understanding the 
impact that globalization trends have on the different facets of our environment [3]. From this perspective, 
the scope and limits of new methodological tools based on data science and their possible impact on the 
discipline of global studies are explored [4]. Indeed, one of the most important transformations today is the 
exponential growth of data that opens the possibility of new ways to analyze reality [5]. Thus, the value of 
information no longer resides in concrete data, but in the way massive data is correlated to discover hidden 
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patterns, trends and associations [6]. Until now, the analysis of big data has been used in different disciplines 
of scientific research whose purpose has been to solve complex problems, in the fields of environment, 
education, health, transportation, national security and biomedicine [7]. 

Data science is made up of three areas, big data, which is used to process data, data mining, whose 
purpose is to find patterns and the visualization of data, whose purpose is to facilitate the understanding of 
information in a clear way and promote their socialization [8]. With regard to data mining, it is considered as 
the analysis of observed data sets, generally of large volume, with the aim of finding new relationships 
between variables, as well as the correct summary of said data sets in an understandable and comprehensive 
way useful [9]. For this reason, data mining is considered a fundamental step in the discovery of information 
patterns [10]. As indicated, data mining encompasses a series of modeling techniques oriented to different 
requirements of organizations, with respect to the educational sector, the use of these techniques can be seen 
in a series of works aimed at improving the quality of university education [11], [12]. Within data mining, 
there is text mining, which is the process of finding implicit patterns, previously unknown, that can be useful 
from a text repository [13]. Likewise, text mining consists of the search for regularities or patterns found in a 
text, from machine learning techniques, therefore, it is considered as one of the many branches of 
computational linguistics [14]. 

Text mining is embedded in the techniques of natural language processing (NLP), which is a field of 
computer science, artificial intelligence and linguistics that studies the interactions between computers and 
human language, through the syntactic, semantic, pragmatic and morphological analysis; These rules, in 
combination with the information stored in computational dictionaries, define the patterns to be recognized 
by means of images, text and voice [15], [16]. Thanks to the expansion of technology, in recent years there 
has been an exponential growth in subjective information available on the Internet. This phenomenon has 
sparked interest in sentiments analysis (AS), a natural language processing (NLP) task that identifies 
opinions related to an object [17]. All of this enriches information to the educational sector, in its search to 
improve the quality of the service, because nowadays the communication and interaction between students 
and teachers are social networks [18], [19]. The quality of the educational service has become a social 
requirement based on the educational policy implemented at the international level, globalization being a 
factor that inclines us to think that quality standards are established at the international level based on 
indicators of excellence in academic productivity, which are the results of scientific and technological 
research [20]. Given what has been described, a crucial ingredient to achieve quality is a great institutional 
capacity to adapt with initiative to new requirements and changes in its environment; that is, to function as a 
social and organizational system that combines homogeneity and homeostasis in a state of equilibrium, 
applying a series of self-regulated processes-related to organizational properties [21], [22]. 

In addition to the functions inherent to their mission, organizations play an important role in 
promoting efforts to improve quality in higher education through the transfer of knowledge and experience of 
quality processes and the practices to implement them [23]. One of the universal principles of quality 
management is customer focus. Educational institutions, in this search for opportunities to improve, have 
been identifying models to evaluate student satisfaction in tune with trends in quality management and 
performance excellence [24]. As indicated in [25], knowing the dimension of student satisfaction with the 
institution to which they attend will allow the identification of both positive and negative aspects, the latter 
being fundamental when determining strategies to improve educational quality. The purpose of analyzing 
student satisfaction is complex since multiple factors such as personal, social, academic, institutional or 
technological influence, among others [26]. According to [27], the key factor is related to the ability of 
teachers to offer a response according to the expectations and needs of the students. As indicated in [28] the 
performance of the university teacher is fundamental for the assessment of the quality of the educational 
service. Usually to measure student satisfaction, traditional tools and instruments such as student surveys and 
questionnaires are used. Nowadays, under this virtualization context, we seek to implement new methods and 
procedures in order to facilitate the use of digital tools [29]. Taking advantage of this way the benefits of 
social networks, and their relationship with self-learning, collaborative work and critical thinking, taking into 
account, in addition, that by this means students can share, respond, comment and discuss various 
information of the teaching process- learning [30]. 

In this sense, this article aims to identify the factors that influence the satisfaction of university 
students about their learning in virtual environments, through the sentiments analysis. In this way, it would 
be contributing to the improvement of the quality of the university service in relation to the factor of teaching 
performance, which, under the perception of the students obtained annually by the traditional questionnaires, 
needs to improve. Likewise, the analysis of big data in the field of education is a subject still little explored, 
which is why this article contributes to the debate on the contributions that data science offers us. 
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2. LITERARY REVIEW 

Text analysis, in which the texts are generated by tweets and other social networks, becomes even 
more complicated than the text coming from other more formal sources, or that go through a filter before 
being published, due to the freedom with which people write them, such as the use of abbreviations, incorrect 
writing, use of terms specific to each country and the inclusion of special characters, such as symbols, URLs, 
retweet entities, among others [31]. In this regard in [32], the author concludes that extraction tools are very 
useful when carrying out the process of collecting data from the web, to use these tools depends a lot on the 
structure of each web page and its data structure, In the case of Twitter, it is taken into account that the data is 
presented as the page progresses. Likewise, in an investigation related to opinion mining through twitter, it is 
concluded that the system for the analysis of opinions in the social network Twitter was successfully 
developed, extracting tweets, lemmatizing tokens, classifying and analyzing corpora and finally, display the 
results [33]. 

And it is that sources such as Twitter are recommended for the generation of opinion databases for 
two reasons, firstly because of the feasibility of the type of data (text) that allows analysis through NLP, 
unlike other networks in that its content consists of unstructured data (images or videos), and secondly by the 
volume of data generated that allows reaching minimum levels to perform complex analyzes [16]. In the 
same line of research in [14]. It is indicated that new tools have been created to facilitate access to the large 
amount of information that is generated daily, one of the most used at the organizational level is text mining, 
which offers the educational organization the possibility of exploring large amounts of texts, not organized in 
the form of data, establishing patterns and extracting useful knowledge in terms of educational management. 


3. METHOD 
3.1. Descripcion of the data collected 

The research work takes as a unit of analysis the comments or opinions expressed in the social 
network twitter by the students enrolled in the automatic process control course, of the professional school of 
mechanical and electrical engineering of the National Technological University of Lima Sur (UNTELS). The 
data acquisition was carried out through a non-experimental longitudinal type design, since the data is 
acquired in each class session. This collection period is from week 9 to week 13. It is non-experimental 
because no action is exercised on the students that stimulates an alteration or modification of their comments, 
so the data is treated in its natural condition. 

It should be noted that the research project as a pilot test takes as a data collection period those 5 
weeks, in order to obtain a first result and then apply this model to the analysis of the sixteen weeks of class 
that comprise an academic cycle, in the subjects of professional engineering schools in Peru. In Figure 1, the 
number of interactions that were generated in the twitter account during the study period is observed. It 
should be noted that the total impressions obtained in the 5 weeks of data acquisition were 455, and the 
number of interactions or comments represented by “tweets” was 220. 
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Figure 1. Interactions in the twitter account in the study period 


3.2. Description of pre-processing, analysis of sentiments polarization and data extraction 
In Figure 2, the model used for the identification of relevant factors of university student satisfaction 
is shown, under the sentiments analysis and the IF-IDF approach (Frequency of completion and frequency of 
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reverse document). The model shown specifies that from the data provided by the social network twitter in 
file format with the extension "csv", these will be conditioned to obtain the polarization of the comments 
made by the social network twitter through sentiments analysis. The conditioning or pre-processing consists 
of eliminating repeated texts, individual characters that are isolated, empty spaces that are found between text 
and text, as well as converting all texts to lower case. 
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Figure 2. Model used to identify relevant factors of university student satisfaction 


Next, the sentiments analysis of the tweets written in English is carried out, using Python libraries 
such as “nltk” and “vader”. The result obtained will be the quantification of the feeling contained in each 
comment or opinion written by the student; the same ones that are classified in positive polarity, neutral 
polarity and negative polarity, representing output 1 of the model used; the same that will represent relevant 
information for decision-making in the improvement of the teaching-learning process of the process control 
subject. Table 1 specifies the ranges in which the “nltk’” and “vader” libraries classify the sentiments 
contained in each tweet according to their polarity. 


Table 1. Classification of sentiments according to their polarity through NLTK and vader 
Sentiments polarity 
Positive Neutral Negative 
Greater than 1 Equalto0 Less tan 1 


Likewise, in the proposed model it is observed that the last stage is related to the extraction of texts, 
through the TF-IDF approach, where TF represents the termination frequency and IDF represents the reverse 
document frequency; This indicator (TF-IDF) will allow selecting the texts that appear more in each of the 
comments or opinions and at the same time that appear less globally. Through this stage, the model allows 
identifying the relevant factors in university satisfaction with teaching performance in virtual teaching. This 
second output (output 2) of the proposed model will also provide the TF-IDF value, which represents the 


66299 66599 


weight of a text “i” in a tweet “j”. In (1), the relationship between TF-IDF, TF and IDF is shown, associating 
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it with the comments obtained through the social network twitter, where “i” and “j” represent the number of 
texts in a tweet and “j” Represents the number of tweets, obtained during the data collection period, where 


66599 SEE 


Tei represents the texs “i” and Twj represents tweet “j”. 
TF — IDF (Tei, Twj) = TF (Tei,Twj) x IDF (Tei, Twj) (1) 


3.3. Development of the proposed model through Python in the JupyterLab environment 

Following the proposed model, Figure 3 shows the programming code that will allow loading or 
reading in the JupyterLab environment the file provided by twitter, called “tweets process_control 
29Dic21.csv”. The "pandas" library will be imported into the program with which it will be possible to read 
and write data, in different formats or extensions, in this investigation the file has a "csv" extension; the 
"numpy” library will be used to create vectors or matrices in which it will seek to link the texts with their 
corresponding sentiments and polarity. 


import numpy as np 


import pandas as pd 


df=pd.read_csv("tweets_process_control_29Dic21.csv") 


Figure 3. Script for reading twitter data in Python via JupyterLab 


Once the data has been read, we proceed to carry out the conditioning or pre-processing of the texts 
contained in the comments on the twitter account, making use of the Python libraries ("nltk.sentiment.vader", 
"SentimentIntensityAnaly zer", "re"); it should be noted that these libraries are strictly linked to text 
conditioning, to then determine the level of sentiment. Once the texts had been conditioned, and all of them 
were standardized, the sentiments contained in each of the tweets, written by the students of the process 
control subject, was determined. In Figure 4 the programming code is shown, in which a column will be 
generated on the right side of the texts, called “sentiment” in which the results of the sentiments contained in 
each tweet will be stored. With this new data, a new matrix is generated containing the file 
"sentiments_data.csv". As evidence of the result obtained, Table 2 shows the sentiments value of the first 5 
tweets; as mentioned in the paragraph, this data is stored in the column called “sentiment”. From these results, it 
was possible to assign the corresponding polarity to each Tweet; and with this, in this stage, the satisfaction 
results will be obtained from the comparison of the number of comments with positive polarity and negative 
polarity. 


df= pd.read_csv("tweets_process_control_29Dic21.csv") 


sid =SentimentIntensityAnalyzer() 


df["sentimiento"]=df["Comments on the satisfaction of the teaching performance"].apply(lambda i: sid.polarity_scores(i)[' compound’ ]) 


df.to_csv("sentiments_data.csv") 


df .head() 


Figure 4. Script to obtain the sentiments contained in each Tweet 


Table 2. Result of the sentiments contained in each Tweet 
Number Tweet Comments on the satisfaccion of the teaching performance Sentiment 


0 It has been a good way to deepen the topic sho... 0.4404 
1 I would have liked the problems proposed ih the... 0.2960 
2 I found it interesting that the maximum oversh... 0.4019 
3 The statements are the same as those learned ... 0.0000 
4 There are topics that were not clear in other ... -0.2924 
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Following the stages described in the proposed model, Figure 5 shows the programming code in 
which it seeks to extract the most relevant texts, contained in the sentiments tweets whose polarity is positive 
and negative, since these are closely related to the satisfaction and dissatisfaction of the teaching 
performance. The code makes use of libraries such as "stopwords" and "Brown" imported from the 
nltk.corpus library, the same ones that allow text processing; the "stopwords" contain a group of words that 
in simple terms have no meaning, so it will help to make the correct extraction of relevant "words". While the 
corpus “Brown” contains a number of representative words that allow them to be used by comparison to 
extract words that have a semantic and linguistic sense from any document, written in English or Spanish. 


from nltk.corpus import stopwords 
from nltk.corpus import brown 


vectorizer = TfidfVectorizer (max_features=10, min_df=10, max_df=0.7, stop_words=stopwords.words('english')) 


processed_features = vectorizer.fit_transform(processed_features).toarray() 


Figure 5. Script para realizar la extracción de palabras, basado en el enfoque TF-IDF 


Also, as mentioned in the description of the proposed model, the approach used for the extraction of 
texts will be TF-IDF; so the "TfidfVectorizer" function is used. It should be noted that this function allows 
setting the parameter "max_features", which represents the number of "words" that appear most frequently 
and that must be extracted from all tweets with positive and negative polarity generated by the students. This 
also specifies the parameter "min_df", which establishes the minimum number of tweets in which the 
extracted words must be found; on the other hand, the parameter "max_df" is also specified, which represents 
the percentage of tweets in which at most the extracted texts should be found. 


4. RESULTS 

Taking into account that the objective of the research is to describe the satisfaction of virtual 
learning from the sentiments analysis, contained in the comments or opinions of university students, carried 
out through the social network Twitter. In Table 3, it is shown shows the polarity result obtained through 
Python and its libraries linked with the natural language toolkit (NLTK). These results show that it is possible 
to obtain a benchmark on student satisfaction using text mining and natural language processing (NLP) 
techniques. 


Table 3. Distribution of the polarity of the sentiments contained in the tweets 
Positive polarization _ Negative polarization _ Neutral polarization 
Tweets numer 126 60 34 
Percent 57.27% 27.27% 15.45% 


Another aspect also considered as part of the research objectives is to determine the relevant factors 
of virtual learning satisfaction, based on predominant text extraction techniques, generated under the TF-IDF 
approach; which is why the following results are described. In Figure 6, the results of the word extraction are 
shown, in which in order to improve the performance and precision of the extraction it was defined that 
"max_features" is equal to 10, "min_df" equal to 10 and "max_df” equal to 0.7; In other words, the 10 words 
are extracted with the highest frequency, and that they appear in at least 10 tweets simultaneously and that 
they appear in at most 70% of all tweets. Also in Figure 6, the IDF levels are shown for each word extracted; 
in which their weights are evidenced with respect to a document or Tweet. The terms with the highest IDF 
were “simulink” and “pid”, with a value of 3.1361; This result makes it possible to establish that the relevant 
factors that contribute significantly to the satisfaction of virtual learning are those related to the teaching 
performance in the class sessions that allow to achieve the understanding of the laboratory sessions through 
the use of simulation tools such as simulink and pid controllers (proportional-integrative-derivative). 
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Now considering the negative polarity tweets, in Figure 7 the results of the word extraction are 
shown, considering “max_features” is equal to 10, “min_df” is equal to 10 and “max_df” is equal to 0.7. In 
other words, the 10 words are extracted with the greatest frequency, and that these appear in at least 10 
tweets simultaneously and that they appear in at most 70% of all tweets; this in order to establish the same 
level of performance and precision in the word extraction process. These extracted texts allow to establish 
that the relevant factors that generated dissatisfaction with virtual learning were linked to the laboratory 
sessions in which graphical representation and block diagrams were used to explain the class session; 
obviously, it was not possible for the students to understand in a meaningful way the use of these tools. 


print(vectorizer.vocabulary_) 
{'class': 1, ‘interesting’: 2, ‘laboratory’: 3, ‘understand': 7, ‘able’: @, 'us': 8, ‘teacher’: 6, ‘use’: 9, ‘simulin 
k': 5, 'pid': 4} 
print (vectorizer.idf_) 
(2.28883902 2.54835022 2.13061502 2.54835022 2.79966465 2.70869287 
2.89974811 2.89974811 3.13613689 3.13613689] 
print(processed_features) 
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Figure 7. Word extraction from tweets with negative polarity 


5. DISCUSSION 

In relation to the degree or level of university student satisfaction regarding teaching performance in 
the virtual teaching-learning environment based on sentiment analysis techniques, whose opinions were 
collected from the social network twitter, it is established that it was possible to assess the degree of 
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satisfaction; associating it with positively polarized tweets, whose result is 57.27%. However, today the 
analysis of the data generated in the anniversary environment related to university satisfaction is generated by 
surveys with categorical response questions defined on a Likert scale, with already defined indicators to be 
assessed, applied once per academic semester which generates highly biased results; this is where the 
relevance of using data science, data mining and natural language processing is focused, with which it is 
possible to generate satisfaction results in real time from open comments, free opinions, in each class session 
and not just once or twice a year, as is traditionally done. The contribution of this study seeks to identify 
which factors should be improved by the teacher in order to have a preponderant effect on improving student 
satisfaction and improving academic quality. 

In this regard, in [34] the authors manage to assess the perception of university students about 
academic activities based on sentiment analysis techniques, and more than focusing on the result obtained 
from satisfaction, I emphasize that in their research they use as an instrument to collect information data a 
survey with 3 categorical response questions and 4 open response questions, and here I disagree with the 
method used to assess satisfaction, since the survey presents already defined factors to be evaluated, which 
leads to questioning how these factors were identified. Secondly, the survey is applied on a single date, which 
leads to questioning and how it determined the moment in which the survey should be applied; these two 
conditions are those that when using social networks such as twitter, open opinion data can be collected, with 
a wide range of elements that the student considers relevant, in each class session, throughout the academic 
semester and its processing can be done in real time. Likewise, in [35] the author seeks to determine the level 
of satisfaction with online teaching using sentiment analysis, using a survey as a data collection technique, 
which again shows that many studies focus on the result without focusing in the disadvantages or weaknesses 
generated by applying traditional methods, without focusing on how data science and natural language 
processing can achieve a transformation of how today determine the perception of satisfaction in university 
institutions. 

Regarding the identification of the relevant factors, the predominant TF-IDF text extraction 
approach was used, which was applied to the comments or opinions expressed by the students during 
different class sessions, through the social network twitter, managing to identify that the most relevant factors 
are linked to the use of simulation software for understanding the theoretical aspects, and it is understandable 
since being students of the school of mechanical and electrical engineering it is necessary to make use of 
these tools. And it is that due to the context of the health emergency, all the courses from the I to the X cycle 
are taught virtually; not all teachers use simulation software, that is where the relevance of changing the 
paradigms of satisfaction assessment is focused on very delimited questions of a survey, and on the contrary, 
techniques should be sought to process texts or comments issued freely by students, through data science. In 
this regard, in [36], [37] the authors make use of the social network twitter for data collection and through 
sentiment analysis they manage to evaluate university satisfaction in universities; in this case they manage to 
extract texts, with the purpose of identifying relevant factors in the evaluation process and achieving 
automated analysis; with which these works are close to the approach proposed in my research; agreeing on 
the method and technique of data processing. 


6. CONCLUSION 

In this article, the factors that influence university student satisfaction about their learning in virtual 
environments were identified, through sentiments analysis and the TF-IDF approach, showing that it is 
possible to obtain a reference point on student satisfaction using text mining and natural language processing 
(NLP) techniques. Given this, it was identified that the relevant factors that contribute significantly to the 
satisfaction of virtual learning are those related to the teaching performance in the use of simulation tools that 
allow understanding the class sessions. Once the research has been carried out, it is recommended to 
implement data science techniques and sentiments analysis relying on social networks in order to determine 
university student satisfaction with respect to the performance of the teacher and that higher institutions can 
make timely decisions for the same academic semester, and that these are not carried out after the end of the 
academic semester, as is currently done in many universities. For future work, it is recommended to expand 
the line of research by developing a predictive model of student satisfaction, with respect to the development 
of the teacher. Likewise, the results obtained could be compared with those obtained through traditional 
techniques and instruments for evaluating teacher performance satisfaction in order to establish the most 
optimal model to know the satisfaction of the students. 
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